The Unsupervised Acquisition of a Lexicon from Continuous Speech
نویسنده
چکیده
We present an unsupervised learning algorithm that acquires a natural-language lexicon from raw speech. The algorithm is based on the optimal encoding of symbol sequences in an MDL framework, and uses a hierarchical representation of language that overcomes many of the problems that have stymied previous grammar-induction procedures. The forward mapping from symbol sequences to the speech stream is modeled using features based on articulatory gestures. We present results on the acquisition of lexicons and language models from raw speech, text, and phonetic transcripts, and demonstrate that our algorithm compares very favorably to other reported results with respect to segmentation performance and statistical efficiency. Copyright c © Massachusetts Institute of Technology, 1995 This report describes research done at the Artificial Intelligence Laboratory of the Massachusetts Institute of Technology. This research is supported by NSF grant 9217041-ASC and ARPA under the HPCC and AASERT programs.
منابع مشابه
Learning Part-of-Speech Guessing Rules from Lexicon: Extension to Non-Concatenative Operations
One of the problems in part-of-speech tagging of real-word texts is that of unknown to the lexicon words. In (Mikheev, 1996), a technique for fully unsupervised statistical acquisition of rules which guess possible parts-ofspeech for unknown words was proposed. One of the over-simplification assumed by this learning technique was the acquisition of morphological rules which obey only simple con...
متن کاملUnsupervised Learning of Word-Category Guessing Rules
Words unknown to the lexicon present a substantial problem to part-of-speech tagging. In this paper we present a technique for fully unsupervised statistical acquisition of rules which guess possible partsof-speech for unknown words. Three complementary sets of word-guessing rules are induced from the lexicon and a raw corpus: prefix morphological rules, suffix morphological rules and ending-gu...
متن کاملA procedure for unsupervised lexicon learning
We describe an incremental unsupervised procedure to learn words from transcribed continuous speech. The algorithm is based on a conservative and traditional statistical model, and results of empirical tests show that it is competitive with other algorithms that have been proposed recently for this task.
متن کاملOn-line learning of acoustic and lexical units for domain-independent ASR
We are interested in on-line acquisition of acoustic, lexical and semantic units from spontaneous speech. Traditional ASR techniques require the domain-speci c knowledge of acoustic, lexicon data and more importantly the word probability distributions. In this paper we propose an algorithm for unsupervised learning of acoustic and lexical units from out-of-domain speech data. The new lexical un...
متن کاملA Simple Unsupervised Learner for POS Disambiguation Rules Given Only a Minimal Lexicon
We propose a new model for unsupervised POS tagging based on linguistic distinctions between open and closed-class items. Exploiting notions from current linguistic theory, the system uses far less information than previous systems, far simpler computational methods, and far sparser descriptions in learning contexts. By applying simple language acquisition techniques based on counting, the syst...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/cmp-lg/9512002 شماره
صفحات -
تاریخ انتشار 1995